Handling and analysis of the Enron Email Dataset - Part 1

The class definitions

EnronEmailParser class
- Parser for the emails included in the Enron Email Dataset.
- This particular implementation treats all recipients including to, cc and bcc recipients as same type
EnronEmailDataset class
- Data handler for the Enron Email Dataset
- It relies on the EnronEmailParser class to do the actual email parsing.
- It uses pandas dataframes as the data storage objects.



In [1]:

    
from IPython.display import display
import pandas as pd
from enrondatahandling import EnronEmailDataset

Basic Setup

Having defined the basic classes that will handle the data and parsing for us, we can now start to load and parse our data. The two main tables, aka dataframes, are shown below (limited to the top 5 rows in each case).



In [2]:

    
# Load and parse the enron email dataset
enronData = EnronEmailDataset('./data')









    



Surveyed 1702 email files
Parsed 1702 emails
Found 83 responses



In [3]:

    
# Let's take a look at the emails table
enronData.emails.head()









    Out[3]:






  
    
      
      email_id
      datetime
      ts
      tz
      sender
      num_tos
      num_ccs
      num_bccs
      num_recipients
      subject
      num_lines_in_msg
    
    
      email_id
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      ./data/4/54650.txt
      ./data/4/54650.txt
      2001-06-28 04:04:57-07:00
      993726297
      tzoffset(u'PDT', -25200)
      j.kaminski@enron.com
      1
      0
      0
      1
      RE: Thu evening
      78
    
    
      ./data/6/173776.txt
      ./data/6/173776.txt
      2000-07-18 06:49:00-07:00
      963928140
      tzoffset(u'PDT', -25200)
      steven.kean@enron.com
      1
      0
      0
      1
      Re: Price Cap Media--DRAFT
      81
    
    
      ./data/1/138102.txt
      ./data/1/138102.txt
      2001-11-14 08:35:46-08:00
      1005755746
      tzoffset(u'PST', -28800)
      john.shelk@enron.com
      1
      1
      0
      2
      RE: Dynegy/Enron Point of Contact
      51
    
    
      ./data/1/173413.txt
      ./data/1/173413.txt
      2000-02-20 09:53:00-08:00
      951069180
      tzoffset(u'PST', -28800)
      steven.kean@enron.com
      1
      2
      0
      3
      Re: Trade Mission
      315
    
    
      ./data/1/219048.txt
      ./data/1/219048.txt
      2001-08-10 15:40:25-07:00
      997483225
      tzoffset(u'PDT', -25200)
      ray.alvarez@enron.com
      2
      2
      0
      4
      CONFIDENTIAL Attached file
      15



In [4]:

    
# The recipients table is being maintained separately so as to not keep lists as values in the dataframe
enronData.recipients.head()









    Out[4]:






  
    
      
      email_id
      recipient
      type
    
  
  
    
      0
      ./data/1/10425.txt
      kenneth.lay@enron.com
      to
    
    
      1
      ./data/1/10425.txt
      mark.frevert@enron.com
      to
    
    
      2
      ./data/1/10425.txt
      jeff.skilling@enron.com
      to
    
    
      3
      ./data/1/10425.txt
      mark.schroeder@enron.com
      to
    
    
      4
      ./data/1/10425.txt
      joseph.sutton@enron.com
      to

Basic analysis

Let's now do some basic analysis to see how we can use this data and play with it to get some insights and information of value.

Note: In both the questions below, I have included the people on the to list as well as the cc list and the bcc list to mean recipients.

Question 1

In the next couple sections I am trying to answer the following question:

Let's label an email as "direct" if there is exactly one recipient and "broadcast" if it has multiple recipients. Identify the top 3 people who received the largest number of direct emails and the person (or people) who sent the largest number of broadcast emails.



In [5]:

    
directs = pd.merge(
    enronData.recipients, 
    enronData.emails[enronData.emails['num_recipients'] == 1], 
    left_on='email_id', 
    right_index=True)[['ts', 'recipient']]
directs = (
    directs.groupby('recipient')
    .count()
    .rename(columns={'ts': 'count_direct'})
    .sort_values(by='count_direct', ascending=[0]))
directs.head()









    Out[5]:






  
    
      
      count_direct
    
    
      recipient
      
    
  
  
    
      maureen.mcvicker@enron.com
      115
    
    
      vkaminski@aol.com
      43
    
    
      jeff.dasovich@enron.com
      25
    
    
      richard.shapiro@enron.com
      23
    
    
      elizabeth.linnell@enron.com
      18



In [6]:

    
broadcasts = enronData.emails[enronData.emails['num_recipients'] > 1][['sender', 'ts']]
broadcasts = (
    broadcasts.groupby('sender')
    .count()
    .rename(columns={'ts': 'count_broadcast'})
    .sort_values(by='count_broadcast', ascending=[0]))
broadcasts.head()









    Out[6]:






  
    
      
      count_broadcast
    
    
      sender
      
    
  
  
    
      steven.kean@enron.com
      252
    
    
      john.shelk@enron.com
      83
    
    
      j.kaminski@enron.com
      40
    
    
      miyung.buster@enron.com
      31
    
    
      alan.comnes@enron.com
      19

Answer 1

Based on the outputs above, we can say:

The top three people who received the largets number of direct mail are:
1. Maureen McVicker (maureen.mcvicker@enron.com)
2. V Kaminski (vkaminski@aol.com)
3. Jeff Dasovich (jeff.dasovich@enron.com)
The person who sent the largest number of direct email is Steven Kean

Question 2

In the section I am trying to answer the following question:

Find the five emails with the fastest response times. Please include file IDs, subject, sender, recipient, and response times. (A response is defined as a message from one of the recipients to the original sender whose subject line contains all of the words from the subject of the original email, and the response time should be measured as the difference between when the original email was sent and when the response was sent.)



In [7]:

    
responses = enronData.responses.sort_values(by='response_time_in_secs').reset_index()
responses = responses[[
        'email_id', 
        'sender',
        'subject', 
        'email_id_response', 
        'sender_response',
        'subject_response', 
        'response_time_in_secs']]
responses.head()









    Out[7]:






  
    
      
      email_id
      sender
      subject
      email_id_response
      sender_response
      subject_response
      response_time_in_secs
    
  
  
    
      0
      ./data/1/139495.txt
      rod.hayslett@enron.com
      FW: Confidential - GSS Organization Value to ETS
      ./data/1/151121.txt
      stanley.horton@enron.com
      FW: Confidential - GSS Organization Value to ETS
      148
    
    
      1
      ./data/1/228996.txt
      michelle.cash@enron.com
      RE: CONFIDENTIAL Personnel issue
      ./data/4/228911.txt
      lizzette.palmer@enron.com
      RE: CONFIDENTIAL Personnel issue
      236
    
    
      2
      ./data/4/122923.txt
      paul.kaufman@enron.com
      RE: Eeegads...
      ./data/3/122926.txt
      jeff.dasovich@enron.com
      RE: Eeegads...
      240
    
    
      3
      ./data/1/121747.txt
      karen.denne@enron.com
      Re: CONFIDENTIAL - Residential in CA
      ./data/3/121748.txt
      jeff.dasovich@enron.com
      Re: CONFIDENTIAL - Residential in CA
      240
    
    
      4
      ./data/1/201878.txt
      m..tholt@enron.com
      FW: SRP SETTLEMENT PROPOSAL - PRIVILEGED AND C...
      ./data/4/200845.txt
      stephanie.miller@enron.com
      FW: SRP SETTLEMENT PROPOSAL - PRIVILEGED AND C...
      262

Answer 2

Based on the outputs above, we can say that the five emails with the fastest response times in order are:

data/1/139495.txt sent by rod.hayslett@enron.com regarding "FW: Confidential - GSS Organization Value to ETS"
data/1/228996.txt sent by michelle.cash@enron.com regarding "RE: CONFIDENTIAL Personnel issue"
data/4/122923.txt sent by paul.kaufman@enron.com regarding "RE: Eeegads..."
data/1/121747.txt sent by karen.denne@enron.com regarding "Re: CONFIDENTIAL - Residential in CA"
data/1/201878.txt sent by m..tholt@enron.com regarding "FW: SRP SETTLEMENT PROPOSAL - PRIVILEGED AND C..."



In [ ]:

	email_id	datetime	ts	tz	sender	num_tos	num_ccs	num_bccs	num_recipients	subject	num_lines_in_msg
email_id
./data/4/54650.txt	./data/4/54650.txt	2001-06-28 04:04:57-07:00	993726297	tzoffset(u'PDT', -25200)	j.kaminski@enron.com	1	0	0	1	RE: Thu evening	78
./data/6/173776.txt	./data/6/173776.txt	2000-07-18 06:49:00-07:00	963928140	tzoffset(u'PDT', -25200)	steven.kean@enron.com	1	0	0	1	Re: Price Cap Media--DRAFT	81
./data/1/138102.txt	./data/1/138102.txt	2001-11-14 08:35:46-08:00	1005755746	tzoffset(u'PST', -28800)	john.shelk@enron.com	1	1	0	2	RE: Dynegy/Enron Point of Contact	51
./data/1/173413.txt	./data/1/173413.txt	2000-02-20 09:53:00-08:00	951069180	tzoffset(u'PST', -28800)	steven.kean@enron.com	1	2	0	3	Re: Trade Mission	315
./data/1/219048.txt	./data/1/219048.txt	2001-08-10 15:40:25-07:00	997483225	tzoffset(u'PDT', -25200)	ray.alvarez@enron.com	2	2	0	4	CONFIDENTIAL Attached file	15

	email_id	recipient	type
0	./data/1/10425.txt	kenneth.lay@enron.com	to
1	./data/1/10425.txt	mark.frevert@enron.com	to
2	./data/1/10425.txt	jeff.skilling@enron.com	to
3	./data/1/10425.txt	mark.schroeder@enron.com	to
4	./data/1/10425.txt	joseph.sutton@enron.com	to

	count_direct
recipient
maureen.mcvicker@enron.com	115
vkaminski@aol.com	43
jeff.dasovich@enron.com	25
richard.shapiro@enron.com	23
elizabeth.linnell@enron.com	18

	count_broadcast
sender
steven.kean@enron.com	252
john.shelk@enron.com	83
j.kaminski@enron.com	40
miyung.buster@enron.com	31
alan.comnes@enron.com	19

	email_id	sender	subject	email_id_response	sender_response	subject_response	response_time_in_secs
0	./data/1/139495.txt	rod.hayslett@enron.com	FW: Confidential - GSS Organization Value to ETS	./data/1/151121.txt	stanley.horton@enron.com	FW: Confidential - GSS Organization Value to ETS	148
1	./data/1/228996.txt	michelle.cash@enron.com	RE: CONFIDENTIAL Personnel issue	./data/4/228911.txt	lizzette.palmer@enron.com	RE: CONFIDENTIAL Personnel issue	236
2	./data/4/122923.txt	paul.kaufman@enron.com	RE: Eeegads...	./data/3/122926.txt	jeff.dasovich@enron.com	RE: Eeegads...	240
3	./data/1/121747.txt	karen.denne@enron.com	Re: CONFIDENTIAL - Residential in CA	./data/3/121748.txt	jeff.dasovich@enron.com	Re: CONFIDENTIAL - Residential in CA	240
4	./data/1/201878.txt	m..tholt@enron.com	FW: SRP SETTLEMENT PROPOSAL - PRIVILEGED AND C...	./data/4/200845.txt	stephanie.miller@enron.com	FW: SRP SETTLEMENT PROPOSAL - PRIVILEGED AND C...	262